author: Raymond Yee (@rdhyee)
last edit: 2017.07.03

Why Care about Quarry?

After expending considerable effort using the Wikipedia API to assemble a database of user edit metadata in support of the work reported "Gender Differences in Wikipedia Editing" (a prize-winning short paper for WikiSym 2011), I've been long intrigued by having direct access to the database behind Wikipedia. With direct database access, there would no need to accumulate data by polling the API, a laborious and error-prone process used for the paper. Just sign up for an account on Wikimedia Labs, and connect to the live database replicas. It helps to understand the Mediawiki database layout and to study example MySQL queries that can be run on the replicas. Undoubtedly, there is a learning curve; there's lot of documentation out there but I've found it a chore to make sense of it all. (If you get confused, don't forget to ask for help on the Labs-l Mailing List or the wikimedia labs IRC channel.

In my self-education about the Wikipedia databases, I've been particularly happy to find Quarry, a web application (currently in beta), that allows one to "[r]un SQL queries against Wikipedia & other databases from your browser". You can run SQL queries and see the results in the browser, download results, and also easily share your queries and results with others. You can also see what queries other people have run, from which you can presumably learn much. (Finally, you don't need a shell account on Wikimedia Labs to use the service.)

I've been enjoying Quarry a lot, but wanted to integrated its functionality with other software. Specifically, I wanted to integrate work I'm doing on Quarry with the IPython/Jupyter notebook. To begin with, it'd be convenient to be able to programmatically access the output from queries. It turns out that you can make use of an implicit Quarry API to read not only the output data of a query but associated metadata, including the query SQL itself, the author, title, and date of execution. Some things I describe here:

  • What queries have I made at Quarry?
  • Specifically, can one get at which ones are published and drafts?
  • Can one get data on specific runs?
  • What queries are starred?

As I began to write integration with Quarry and surface my own work on Quarry, I then start to explore the question of what other people are doing on Quarry. Such questions as:

  • What users are there? Can I get the same info for those folks?
  • What popular queries are there?
  • Can we understand the flow of use -- are people learning from each other?

This notebook shows some preliminary work in that area.

How to learn more about Quarry

BTW, don't miss the main documentation page for Quarry and three-part Wikiresearch webinar series, which includes videos ( 1, 2, 3) with some coverage on using Quarry.)

Python imports


In [ ]:
%matplotlib inline

In [ ]:
from __future__ import print_function

In [ ]:
import datetime
from itertools import (islice, chain)
import re
import time
from collections import (Counter, OrderedDict)

# writing for eventual Python 2/3 compatability 

try:
    from urllib.parse import urlencode
except ImportError:
    from urllib import urlencode
        
import requests

from lxml.html import fromstring, parse

import matplotlib.pyplot as plt
from pandas import (DataFrame, Series)

from IPython.display import (display, HTML, Markdown)

Working with results from a specific query

As a big fan of the music of J. S. Bach, I'm also very interested in the history of the English Wikipedia page for Johann Sebastian Bach. How did the article develop over the years? When was it most actively edited?

For a simple question to answer using Quarry, I wanted to compute the number of revisions by year for Johann_Sebastian_Bach in enwiki.

The easiest workflow with which to begin is to refine a query on Quarry and then download the resultset. That is, for a recent run, I could analyze the output by first downloading the JSON-formatted data from http://quarry.wmflabs.org/run/24183/output/0/json?download=true. But if I were to run the query again, the URL for the output changes again because the run number changes.

A Python function to download information about a Quarry query

For the purposes of analyzing the output from queries in Quarry, instead of manually downloading the output of a Quarry query (and the accompanying metadata), it is ultimately easier to programatically obtain that output.

Through a combination of reverse engineering Quarry, greatly helped the open availability of source code for Quarry, I wrote the following Python functions to obtain the output and to display that output in a Jupyter notebook.


In [ ]:
def quarry_result(query_id, rev_id=None, result_format='json'):
    """
    returns a dict {'status', 'output', 'query_meta' 'latest_run', 'latest_rev'} for a query_id on quarry.wmflabs.org
    """
    
    BASE_URL = "http://quarry.wmflabs.org"
    
    # get the metadata for the query
    # https://github.com/wikimedia/analytics-quarry-web/blob/e2346c5ec47d63f9514b8aef9190211fa5ff0e45/quarry/web/app.py#L402
    url = BASE_URL + "/query/{0}/meta".format(query_id)
    
    try: 
        query_meta = requests.get(url).json()
        status = query_meta['latest_run']['status']
        latest_run_id = query_meta['latest_run']['id']
        latest_rev_id = query_meta['latest_rev']['id']

        # if status is 'complete' and return_output is True,
        # grab the results -- unless result_format is None
        # https://github.com/wikimedia/analytics-quarry-web/blob/e2346c5ec47d63f9514b8aef9190211fa5ff0e45/quarry/web/app.py#L350

        if (status == 'complete' and result_format is not None):
            # TO DO: figure out whether 0 should be a variable
            # re: https://github.com/wikimedia/analytics-quarry-web/blob/e2346c5ec47d63f9514b8aef9190211fa5ff0e45/quarry/web/app.py#L351
            # 0 is default value for resultset_id 
            url = BASE_URL + "/run/{0}/output/0/{1}".format(latest_run_id, result_format)
            if result_format == 'json':
                output = requests.get(url).json()
            else:
                output = requests.get(url).text
        else:
            output = None

        return {'query_meta':query_meta, 
                'status': status,
                'latest_run': latest_run_id,
                'latest_rev': latest_rev_id,
                'output': output
               }
    except Exception as e:
        return e
    
# assume for now latest rev id same as latest run id.
# 'status':query_meta["status"]
        
def display_objs_for_q_result(q):
    """
    returns IPython/Jupyter display object to describe query metadata and SQL content
    (first pass)
    """

    description = q['query_meta']['query']['description']
    if description is None:
        description = ""
        
    return ( HTML("<b>{0}</b>".format(q['query_meta']['query']['title'])),
             HTML("<p>id: {0} ({1})</p>".format(q['query_meta']['query']['id'], 
                                                        q['query_meta']['query']['timestamp'])),
             HTML("<p>{0}</p>".format(description)),
             Markdown("""```sql\n{0}\n```""".format(q['query_meta']['latest_rev']['sql']))
            )

Now to retrieve and display the metadata for the query 3659 http://quarry.wmflabs.org/query/3659


In [ ]:
q = quarry_result(3659, result_format='json')
display(*display_objs_for_q_result(q))

Show the output from the query


In [ ]:
q['output']

Put the output into a pandas DataFrame


In [ ]:
df = DataFrame(q['output']['rows'], columns=['year', 'count'])
df.set_index(keys='year', inplace=True, drop=True)
df

Make a bar chart of the revision counts for the Bach article by year


In [ ]:
df.plot(kind='bar', title="revision counts for JSB article vs year")

This chart alone brings up many questions for future exploration. Why is there a large amount of activity between 2005 to 2007, followed by a dramatic decrease in 2008? Did activity in authorship about Bach-related articles decline only for Johann Sebastian Bach - Wikipedia, the free encyclopedia alone or was there movement in authorship to related articles?

Getting all queries for a given username

The query I wrote about Bach was only one of the queries I wrote on Quarry. Once I could programmatically retrieve a single query, I wanted to interact with all the queries that I had created. To that end, I wrote user_queries to retrieve all the queries for a given user by scraping the profile page for a given user. For example, http://quarry.wmflabs.org/RaymondYee has all the queries for the user RaymondYee.

(I was hoping that the queries would be available to some type of JSON format, but I couldn't find such a source. The source code for Quarry confirms my suspicion that I need to scrape the profile page for a given user.)


In [ ]:
# types of queries

from collections import OrderedDict

query_type_headers = OrderedDict([
        ('published', 'Published queries'),
        ('draft', 'Draft Queries'),
        ('starred', 'Starred Queries')
    ])


def user_queries(username):
    """
    get all queries for a user
    e.g., parse http://quarry.wmflabs.org/RaymondYee
    """
    
    url = u"http://quarry.wmflabs.org/{0}".format(username)
    r = requests.get(url)
    page = requests.get(url).content.decode("UTF-8")
    doc = fromstring(page)
    
    # xpath expressions correlate with template
    # https://github.com/wikimedia/analytics-quarry-web/blob/e2346c5ec47d63f9514b8aef9190211fa5ff0e45/quarry/web/templates/user.html
    
    # number of queries
    queries = dict()
    queries['num_queries'] =int(doc.xpath('//*[@class="user-stat"]/h2/text()')[0])
    

    # loop through all the query types
    for (qtype, qheader) in query_type_headers.items():
        q_elements = doc.xpath('//h3[contains(text(),"{0}")][1]/following-sibling::ul[1]/li/a'.format(qheader))
        q_results = []
        for q in q_elements:
            q_id = int(q.attrib['href'].split('/')[-1])
            #result = quarry_result(q_id, result_format=None)
            q_results.append( (q_id, q.text))
        queries[qtype] = q_results
        
    return queries

Let's now retrieve my queries


In [ ]:
uq = user_queries('RaymondYee')
uq

Putting a bit more work, we can make it easier to jump back to the original queries on Quarry.


In [ ]:
queries_display = ""
queries_display += "<p>number of queries: {0}</p>".format(uq['num_queries'] )

for (qtype, qheader) in query_type_headers.items():
    queries_display += "<b>{0}</b><br/>".format(qheader)
    for (q_id, q_title) in uq[qtype]:
        queries_display += "<div>{0} <a href='http://quarry.wmflabs.org/query/{0}'>{1}</a></div>".format(q_id, q_title)
    queries_display += "<br/>"
    
HTML(queries_display)

We can print out all the SQL for the queries too. Here's the description and SQL for my published and draft queries.


In [ ]:
uq

In [ ]:
ry_results = [quarry_result(q_id, result_format=None) 
    for q_id in chain([q_id for (q_id, title) in uq['published']],
                      [q_id for (q_id, title) in uq['draft']])
              ]

# the following way to display the results is a bit opaque and should be rewritten
display(*list(chain(*[display_objs_for_q_result(r) for r in ry_results if not isinstance(r, Exception)])))

In [ ]:
# problematic queries
# https://quarry.wmflabs.org/query/{q_id/}meta

try:
    from itertools import izip as zip
except:
    pass

q_ids = list(chain([q_id for (q_id, title) in uq['published']],
                      [q_id for (q_id, title) in uq['draft']]))

[(id_) for (id_,r) in zip(q_ids, ry_results) if isinstance(r, Exception)]

One of the next steps I'd want to pursue is to figure out how to programmatically write to Quarry, for example, formulate queries in Python and then send them to Qurray. As a Wikimedia Lab user, I'd want to be able to move queries between Quarry and the job submission facility on the Labs.

Analyzing queries of all Quarry users collectively

A major part of the appeal of Quarry is that you are part of a community of users creating queries. I personally wanted to understand what others are doing on Quarry.

To satisfy my curiosity, I decided to scrape the Recent Queries page for a list of queries and their creators by writing runs_list. The function runs_list will loop all the queries by paging through the entire history available on Recent Queries.


In [ ]:
def anchor_info_or_text(td):
    anchors = td.xpath('a')
    if anchors:
        #if anchor text in form of "/query/query_id", return just query_id
        href = anchors[0].attrib['href']
        query_search = re.search("/query/(\d+)", href)
        if query_search:
            return (int(query_search.group(1)), anchors[0].text )
        else:
            return (href, anchors[0].text )
    else:
        return td.text
    
def filter_none(d):
    """
    return dict d without any items with None for value
    """
    return dict([(k,v) for (k,v) in d.items() if v is not None])

def runs_list(limit=None, from_=None, _sleep=0):
    """
    Generator for all the queries on http://quarry.wmflabs.org/query/runs/all
    """
    
    url = ("http://quarry.wmflabs.org/query/runs/all?" +
            urlencode(filter_none({'from':from_, 'limit':limit})))    
    more_pages = True

    while more_pages:

        r = requests.get(url)
        page = requests.get(url).content.decode("UTF-8")
        doc = fromstring(page)

        # grab headers
        headers = [th.text for th in doc.xpath("//th")]
        
        # yield rows
        for tr in doc.xpath("//tr[td]"):
            yield [anchor_info_or_text(td) for td in tr]
        
        # next link
        next_links = doc.xpath('//li[@class="next"]/a')
        if next_links:
            url = ("http://quarry.wmflabs.org/query/runs/all?" +
                 next_links[0].attrib['href'])
        else:
            more_pages = False
            
        time.sleep(_sleep)

Let's scrape the pages. As of 2015-05-30, there are few enough queries for me to practically get all the pages returned by Quarry.


In [ ]:
print ("time of analysis:", datetime.datetime.utcnow().isoformat(), "\n")
queries = []

# loop and print out which row we're on

for (i, item) in enumerate(islice(runs_list(_sleep=0.5), None)):
    print("\r {0}".format(i), end="")
    queries.append( (item[0][0], item[0][1], item[1][1], item[2], item[3]))

Display the last five results


In [ ]:
df = DataFrame(queries, columns=['id', 'title', 'creator','status', 'time'])
df.tail()

How many queries records belong to me?


In [ ]:
len(df[df.creator=='RaymondYee'])

How many unique query identifiers and creators are there in the data set?


In [ ]:
# how many unique ids

len(df.id.unique()), len(df.creator.unique())

I am surprised such a small number of unique queries. It's possible my code has a bug that causes me to miss many queries. Alternative, Quarry might only return a limited selection. (To figure out.)


In [ ]:
# loop through users to accumlate data on all user queries
# on the way to computing most starred query

queries_by_user = dict()

for (i, username) in enumerate(islice(df.creator.unique(),None)):
    print ("\r {0}".format(i), end="")
    queries_by_user[username] = user_queries(username)

Now let's compute the queries that have been starred by others and the number of stars they have received.


In [ ]:
# let's look at what starred

starred_q = Counter()

for (u, d) in queries_by_user.items():
    for q in d['starred']:
        starred_q.update([q])    

starred_q_display = ""

for (q, count) in starred_q.most_common():
    starred_q_display += u"<div><a href='http://quarry.wmflabs.org/query/{0}'>{1}</a>: {2}</div>".format(q[0],q[1],count)

HTML(starred_q_display)

Some Next Steps

There are lot of other aspects of Quarry to explore, not to mention the Wikimedia databases themselves! Some ideas are:

  • compare SQL code among all the queries. What queries are similar to each other?
  • write Python functions to write jobs directly to Quarry.
  • move jobs between Quarry and the job submission system on the Wikimedia Labs

In [ ]: